Multiple regression is one of the most popular methods used in
statistics as well as in machine learning. We use linear models as a
working model for its simplicity and interpretability. It is important
that we use domain knowledge as much as we could to determine the form
of the response as well as the function format for the factors. Then,
when we have many possible features to be included in the working model
it is inevitable that we need to choose a best possible model with a
sensible criterion. Cp, BIC and
regularizations such as LASSO are introduced. Be aware that if a model
selection is done formally or informally, the inferences obtained with
the final lm() fit may not be valid. Some adjustment will
be needed. This last step is beyond the scope of this class. Check the
current research line that Linda and collaborators are working on.
This homework consists of two parts: the first one is an exercise
(you will feel it being a toy example after the covid case study) to get
familiar with model selection skills such as, Cp and
BIC. The main job is a rather involved case study about
devastating covid19 pandemic. Please read through the case study first.
This project is for sure a great one listed in your CV.
For covid case study, the major time and effort would be needed in EDA portion.
Model building process
Methods
Understand the criteria
CpBICK fold Cross ValidationLASSOPackages
lm(), Anovaregsubsets()glmnet() & cv.glmnet()Review the code and concepts covered during lectures: multiple regression, model selection and penalized regression through elastic net.
If you haven’t done this as part of the homework 2, please attach it here.
ISLR::Auto dataThis will be the last part of the Auto data from ISLR. The original
data contains 408 observations about cars. It has some similarity as the
Cars data that we use in our lectures. To get the data, first install
the package ISLR. The data set Auto should be
loaded automatically. We use this case to go through methods learned so
far.
Final modelling question: We want to explore the effects of each feature as best as possible.
| Var1 | Freq |
|---|---|
| Min. | 9.0 |
| 1st Qu. | 17.0 |
| Median | 22.8 |
| Mean | 23.4 |
| 3rd Qu. | 29.0 |
| Max. | 46.6 |
| Var1 | Freq |
|---|---|
| Min. | 3.00 |
| 1st Qu. | 4.00 |
| Median | 4.00 |
| Mean | 5.47 |
| 3rd Qu. | 8.00 |
| Max. | 8.00 |
| Var1 | Freq |
|---|---|
| Min. | 68 |
| 1st Qu. | 105 |
| Median | 151 |
| Mean | 194 |
| 3rd Qu. | 276 |
| Max. | 455 |
| Var1 | Freq |
|---|---|
| Min. | 46.0 |
| 1st Qu. | 75.0 |
| Median | 93.5 |
| Mean | 104.5 |
| 3rd Qu. | 126.0 |
| Max. | 230.0 |
| Var1 | Freq |
|---|---|
| Min. | 1613 |
| 1st Qu. | 2225 |
| Median | 2804 |
| Mean | 2978 |
| 3rd Qu. | 3615 |
| Max. | 5140 |
| Var1 | Freq |
|---|---|
| Min. | 8.0 |
| 1st Qu. | 13.8 |
| Median | 15.5 |
| Mean | 15.5 |
| 3rd Qu. | 17.0 |
| Max. | 24.8 |
year summary
total 13 years
from 1970-1982
Origin of car
American: 245
European: 68
Japanese: 79
Auto names
unique auto names: 301
time have on MPG?mpg
vs. year and report R’s summary output. Is
year a significant variable at the .05 level? State what
effect year has on mpg, if any, according to
this model.| Dependent variable: | |
| mpg | |
| year | 1.230*** |
| (1.060, 1.400) | |
| Constant | -70.000*** |
| (-83.000, -57.000) | |
| Observations | 392 |
| R2 | 0.337 |
| Adjusted R2 | 0.335 |
| Residual Std. Error | 6.360 (df = 390) |
| F Statistic | 198.000*** (df = 1; 390) |
| Note: | p<0.1; p<0.05; p<0.01 |
Year is significant at the 0.01 level. Our model is saying that for every year that goes by, there is about a 1.230 increase in the mpg of a car.
horsepower on top of the variable year
to your linear model. Is year still a significant variable
at the .05 level? Give a precise interpretation of the
year’s effect found here. (Table 4)_| Dependent variable: | |
| mpg | |
| year | 0.657*** |
| (0.527, 0.787) | |
| horsepower | -0.132*** |
| (-0.144, -0.119) | |
| Constant | -12.700** |
| (-23.200, -2.250) | |
| Observations | 392 |
| R2 | 0.685 |
| Adjusted R2 | 0.684 |
| Residual Std. Error | 4.390 (df = 389) |
| F Statistic | 424.000*** (df = 2; 389) |
| Note: | p<0.1; p<0.05; p<0.01 |
Year is significant at the 0.01 level. Our model is saying that for every year that passes by, there is about a .657 increase in the mpg of a car. This effect size decreases from the previous one since we added horsepower to the dataset. (Table 5)
The confidence intervals got a lot smaller going from (i) to
(ii). Since we added more information to the model
(horspower) this reduces some of the variability that we
see when we examine year alone. This reduction in conifidence interval
means that we are likely getting more precise.
lm(mpg ~ year * horsepower). Is the interaction effect
significant at .05 level? Explain the year effect (if any).| Dependent variable: | |
| mpg | |
| year | 2.190*** |
| (1.880, 2.510) | |
| horsepower | 1.050*** |
| (0.820, 1.270) | |
| year:horsepower | -0.016*** |
| (-0.019, -0.013) | |
| Constant | -127.000*** |
| (-150.000, -103.000) | |
| Observations | 392 |
| R2 | 0.752 |
| Adjusted R2 | 0.750 |
| Residual Std. Error | 3.900 (df = 388) |
| F Statistic | 393.000*** (df = 3; 388) |
| Note: | p<0.1; p<0.05; p<0.01 |
All of the variables are significant at the 0.01 level. Year is an extremely significant variable. Our model is saying that for every year that passes by, there is about a 2.190 increase in the mpg of a car. This effect size increases dramatically from the previous models. (Table 6)
Remember that the same variable can play different roles! Take a
quick look at the variable cylinders, and try to use this
variable in the following analyses wisely. We all agree that a larger
number of cylinders will lower mpg. However, we can interpret
cylinders as either a continuous (numeric) variable or a
categorical variable.
cylinders as a
continuous/numeric variable. Is cylinders significant at
the 0.01 level? What effect does cylinders play in this
model?| Dependent variable: | |
| mpg | |
| cylinders | -3.560*** |
| (-3.840, -3.270) | |
| Constant | 42.900*** |
| (41.300, 44.600) | |
| Observations | 392 |
| R2 | 0.605 |
| Adjusted R2 | 0.604 |
| Residual Std. Error | 4.910 (df = 390) |
| F Statistic | 597.000*** (df = 1; 390) |
| Note: | p<0.1; p<0.05; p<0.01 |
Cylinders is significant at the 0.01 level. Our model is saying that for every 1 cylinder added, there is about a 3.560 increase in the mpg of a car. (Table 7)
cylinders as a
categorical/factor. Is cylinders significant at the .01
level? What is the effect of cylinders in this model?
Describe the cylinders effect over mpg.| Dependent variable: | |
| mpg | |
| factor(cylinders)4 | 8.730*** |
| (4.080, 13.400) | |
| factor(cylinders)5 | 6.820* |
| (-0.217, 13.900) | |
| factor(cylinders)6 | -0.577 |
| (-5.290, 4.140) | |
| factor(cylinders)8 | -5.590** |
| (-10.300, -0.894) | |
| Constant | 20.600*** |
| (15.900, 25.200) | |
| Observations | 392 |
| R2 | 0.641 |
| Adjusted R2 | 0.638 |
| Residual Std. Error | 4.700 (df = 387) |
| F Statistic | 173.000*** (df = 4; 387) |
| Note: | p<0.1; p<0.05; p<0.01 |
Only 4 Cylinders is significant at the 0.01 level. Our model is saying that for every 1 cylinder added, there is about a 3.560 increase in the mpg of a car. (Table 7)
cylinders as a continuous and categorical variable in your
models?From a practical sense it’s not feasible to consider cylinders as a continuous variable because it’ll out put results that don’t make sense. It will assume that the more cylinders you have, the lower your mpg will be. However, considering cylinders as a categorical variable allows you to see that having different numbers of cylinders is not a linear relationship.
mpg is linear
in cylinders vs. fit1: mpg relates to
cylinders as a categorical variable at .01 level?Yes you can using anova(H_0, H_1). There is strong evidence of
rejecting the null hypothesis that fit0: mpg is linear in
cylinders vs. fit1: mpg relates to
cylinders as a categorical variable
## Analysis of Variance Table
##
## Model 1: mpg ~ cylinders
## Model 2: mpg ~ factor(cylinders)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9416
## 2 387 8544 3 871 13.2 3.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
GPM=1/MPG. Compare residual plots of MPG or GPM as
responses and see which one might yield a more satisfactory
patterns.In addition, can you provide some background knowledge to support the
notion: it makes more sense to model GPM?
GPM is the number of gallons needed to move a vehicle 100 miles. The gpm helps determine the vehicle’s fuel economy when taking other considerations into account. Both MPG and GPM have a useful role at different points in owning a car. MPG is useful when you’re driving a car. GPM is useful when you’re purchasing a car – it better captures the fuel consumption, and fuel savings, when comparing a current car to a new car, or when comparing two new cars to each other ref.
Exploring the interaction between hosrsepower and year
From the boxplot, we can see that Asian cars often have significantly higher mpg than American/European cars
| Dependent variable: | |
| mpg | |
| year | 0.757*** |
| (0.660, 0.854) | |
| weight | -0.007*** |
| (-0.007, -0.006) | |
| Constant | -14.300*** |
| (-22.200, -6.500) | |
| Observations | 392 |
| R2 | 0.808 |
| Adjusted R2 | 0.807 |
| Residual Std. Error | 3.430 (df = 389) |
| F Statistic | 819.000*** (df = 2; 389) |
| Note: | p<0.1; p<0.05; p<0.01 |
| Dependent variable: | |
| 1/mpg | |
| year | -0.001*** |
| (-0.001, -0.001) | |
| horsepower | 0.0001*** |
| (0.00005, 0.0001) | |
| weight | 0.00001*** |
| (0.00001, 0.00001) | |
| Constant | 0.100*** |
| (0.086, 0.113) | |
| Observations | 392 |
| R2 | 0.881 |
| Adjusted R2 | 0.880 |
| Residual Std. Error | 0.006 (df = 388) |
| F Statistic | 957.000*** (df = 3; 388) |
| Note: | p<0.1; p<0.05; p<0.01 |
The plots for MPG aren’t as evenly distributed vertically, has outliers, and has more of a shape than when compared to GPM. In the mpg residual plot, one may worry the violation of linearity, as well as presence of heteroscedasticity. There is more room for improvement in the mpg model
The gpm increases with changes in the following features: +
year decreases
+ weight increases + horsepower increases
mpg of a car that is: built in 1983, in the
US, red, 180 inches long, 8 cylinders, 350 displacement, 260 as
horsepower, and weighs 4,000 pounds. Give a 95% CI.From these specifications we predict that the mpg will be 22 with a CI of [15.2, 28.8] using the mpg model
However since the GPM model was better, From these specifications we predict that the mpg will be 15.6 with a CI of [13.1, 19.2] using the GPM model
If we could clean the data to take into account manufacturer that would improve the study
The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 755 million cases have been confirmed worldwide, with nearly 6.8 million associated deaths by Feb of 2023. Within the US alone, there have been over 1.1 million deaths and upwards of 102 million cases reported by Feb of 2023. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.
The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.
We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.
There are two main goals for this case study.
Remark1: The data and the statistics reported here were collected before February of 2021.
Remark 2: A group of RAs spent tremendous amount of time working together to assemble the data. It requires data wrangling skills.
Remark 3: Please keep track with the most updated version of this write-up.
The data comes from several different sources:
In this case study, we use the following three nearly cleaned data:
Among all data, the unique identifier of county is
FIPS.
The cleaning procedure is attached in
Appendix 2: Data cleaning You may go through it if you are
interested or would like to make any changes.
It may need more data wrangling.
First read in the data.
# county-level socialeconomic information
county_data <- fread("data/covid_county.csv")
# county-level COVID case and death
covid_rate <- fread("data/covid_rates.csv")
# county-level lockdown dates
# covid_intervention <- fread("data/covid_intervention.csv")The detailed description of variables is in
Appendix 1: Data description. Please get familiar with the
variables. Summarize the two data briefly.
Race Distribution across US counties
Age Distribution across US counties
Education Distribution across US counties
Employment Distribution across US counties
## $`1`
##
## $`2`
##
## attr(,"class")
## [1] "list" "ggarrange"
Income Distribution across US counties
Unemployment Distribution across US counties